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Abstract 

We prove that a wide class of models of Markov neighbor-dependent substitution 
processes on the integer line is solvable. This class contains some models of nucleotide 
substitutions recently introduced and studied empirically by molecular biologists. We 
show that the polynucleotide frequencies at equilibrium solve explicit finite-size linear 
systems. Finally, the dynamics of the process and the distribution at equilibrium 
exhibit some stringent, rather unexpected, independence properties. For example, 
nucleotide sites at distance at least three evolve independently, and the sites, if encoded 
as purines and pyrimidines, evolve independently. 

Introduction 

In simple Markov models of nucleotide substitution processes, one assumes that each site 
along the DNA sequence evolves independently of the other sites and according to some 
specified rates of substitution. We introduce some notations, which are quite common for 
the biologist and useful to describe conveniently these models, as well as, later on, the more 
sophisticated ones which are the subject of this paper. For these definitions and, later on 
in the paper, other nomenclatures, see [4j. 

Definition 1 The nucleotide alphabet is 

A:= {A, T, C, G}. 

These letters stand for Adenine, Thymine, Cytosine and Guanine, respectively. The nu- 
cleotides A and G are purines, often abbreviated by the letter R, the nucleotides T and 
C are pyrimidines, often abbreviated by the letter Y . The canonical projection ir on the 
purine / pyrimidine alphabet {R, Y} is defined by 

tt(A) := R=:it(G), tt(T) := Y =: tt(C). 

Substitutions of the form R — > R and Y — > Y are called transitions, substitutions of the 
form R — > Y and Y — > R are called transversions. Finally, for every subsets X and Z of 
A, it is customary to write XpZ for the collection of dinucleotides in X x Z. 



For instance, YpR dinucleotides are formed by a purine followed by a pyrimidine, hence 
there are 4 such dinucleotides, namely, CpG, TpA, TpG, and CpA. 

Experimental facts are that transitions are, in many cases, more frequent than transversions 
(typical ratios are 3:1), and that substitutions to C and to G occur at different rates than 
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substitutions to A and to T, see Duret and Galtier and the references in their paper. 
Many studies take this into account, for instance, in Tamura's model, one assumes that 
the matrix of substitution rates is 
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where vx, V2 and k are nonnegative real numbers. This means, for instance, that each 
nucleotide A is replaced by T at rate vi, by C at rate Vi, and by G at rate kvi, 

Under this model and under related ones with no interaction between the sites, the nu- 
cleotide attached to any given site converges in distribution to the stationary measure of 
the Markov chain described by the matrix of the rates and, at equilibrium, the sites are 
independent. This last consequence is unfortunate in a biological context, since the fre- 
quencies F(x) of the nucleotides and the frequencies F{xy) of the dinucleotides, observed 
in actual sequences, are often such that, for many nucleotides x and y, 

F(xy)^F(x)F(y). 

In fact, it is well known that the nucleotides in the immediate neighborhood of a site 
can affect drastically the substitution rates at this site. For instance, in the genomes of 
vertebrates, the increased substitutions of cytosine by thymine and of guanine by adenine 
in CpG dinucleotides are often quite noticeable (typical ratios are 10:1, when compared 
to the rates of the other substitutions). The chemical reasons of this CpG-methylation- 
deamination process are also well known and one can guess that, at equilibrium, the number 
of CpG is decreased while the number of TpG and CpA is increased when one adds high 
rates of CpG substitutions to Tamura's model. 

The need to incorporate these effects into more realistic models of nucleotide substitutions 
seems widely acknowledged. However, the exact consequences of the introduction of such 
neighbor-dependent substitution processes (in the case above, CG — ► CA and CG — > TG), 
while crucial for a quantitative assessment of these models, remain virtually unknown, 
at least up to our knowledge, on a theoretical ground. To understand why, note that 
the distribution of the value of the nucleotide at site i at a given time depends a priori 
on the values at previous times of the dinucleotides at sites % — 1 and i (to account for 
the CG — > CA substitutions), and at sites i and i + 1 (to account for the CG — > TG 
substitutions), whose joint distribution, in turn, depends a priori on the values of some 
trinucleotides, and so on. 

Duret and Galtier jH] introduced and analyzed a model, which we call Tamura + CpG, 
that adds to Tamura's rates of substitution the availability of substitutions CG — > CA 
and CG — ► TG, both at the additional rate g ^ 0. (These authors used a parameter 
Ki ^ k, related to our parameter g and to the ratio 8 := vi/(vi + V2), by the simple 
equation g = {1 — 9) (ki — «).) To evade the curse, explained above, of recursive calls 
to the frequencies of longer and longer words, Duret and Galtier used as approximate 
frequencies F(xyz) of the trinucleotides the values 

F(xyz)^F(xy)F(yz)/F(y). 

Interestingly, these approximations would be exact if the sequences at equilibrium followed 
a Markov model (with respect to the space index i). This is not the case but, relying on 
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computer simulations, Duret and Galtier studied the G + C content at equilibrium and 
other quantities of interest in a biological context, they compared the simulated values to 
the values predicted by their truncated model, and they showed that their approximations 
captured some features of the behavior of the true model. In particular, they highlighted 
that these models had inherent, and previously unexpected, consequences on the frequency 
of the TpA dinucleotide as well. We mention that Arndt and his coauthors considered 
similar models and their biological implications, see Arndt, Burge and Hwa Arndt ^ 
and Arndt and Hwa [3 J for instance. 

In this paper we introduce a wide extension of the Tamura + CpG model of neighbor- 
dependent substitution processes, which we call R/Y + YpR models, and we show that 
these models are solvable. More precisely, we prove that the frequencies of polynucleotides 
at equilibrium solve explicit finite-size linear systems. Thus, the infinite regressions to the 
frequencies of longer and longer polynucleotides described above disappear, and one can 
compute analytically several quantities of interest related to these models. For instance, 
one can assess rigorously the effect of neighbor- dependent substitutions. As noted above, 
the very possibility of such a solution comes as a surprise. Additionally, our analysis 
provides some stringent independence properties of these models at equilibrium, which, in 
our view, should make the biologist somewhat more reluctant to use them. 

Acknowledgements We wish to thank the biologists who contributed to this work, in 
particular Laurent Duret, Nicolas Galtier, Manolo Gouy and Jean Lobry. When the need 
arose, they willingly provided facts, references, and their numerous insights about the 
subject of this study. 

1 Description of the models 
1.1 R/Y models 

To explain the basis of our analysis, we first note a key property of Tamura's model, 
described by matrices of type (£Q) in the introduction. 

Property (a) One can complete the matrix of the substitution rates by diag- 
onal elements such that, on the one hand, the joint distributions of the elapsed 
times before a substitution occurs and of the nucleotide that this substitution 
yields coincide for the two purines, and, on the other hand, the joint distribu- 
tion of the elapsed times before a substitution occurs and of the nucleotide that 
this substitution yields coincide for the two pyrimidines. 

Definition 2 The R/Y models are the models of substitutions such that property (a) holds. 

Property (a) involves a comparison of some coefficients of the A and G lines of the matrix 
of substitution rates, and of its T and C lines, namely, the coefficients that correspond to 
the transversions. In the most general model such that property (a) holds, the matrix of 
substitution rates is 
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where the parameters v x are nonnegative. Here * signals some free coefficients, which cor- 
respond to the transitions. The following characterization of the R/Y models is immediate. 

Proposition 3 (R/Y substitution rates) A substitution model is R/Y if and only if 
there exists nonnegative rates v x and w x such that the matrix of substitution rates is 

A T C G 

A ( ■ vt vq wq \ 

T v A ■ w c v G 

C va wt ■ vg 

G \ w A v T v c J 

This means, for instance, that each nucleotide A is replaced by T at rate vt, by C at rate 
vc, and by G at rate wq- The rates v x and w x are indexed by the nucleotide x that the 
corresponding substitution produces. 

The full matrix of an R/Y model, which shows that property (a) holds, is 
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where v x and w x are nonnegative rates. We recall that the diagonal elements represent 
fictitious substitutions x — ► x, which leave the whole process unchanged. 

1.2 Situation of the R/Y models 

Here are some relationships between the R/Y model of substitutions and other classical 
ones, see [Sj. First, Tamura-Nei's model (usually abbreviated as TN93) corresponds to the 
restriction of R/Y such that 

wa vg = wg va, wt vc = wc vt- 

Hence special cases of TN93 are also special cases of R/Y. For instance, Felsenstein's model 
(F84) corresponds to the restriction of TN93 such that 

WA + Vt + V C + Wg = va + Wt + Wc + Vg- 

Hasegawa-Kishino-Yano's model (HKY85) corresponds to the restriction of R/Y such that 
w x /v x does not depend on x. In other words, both F84 and HKY85 are subcases of TN93, 
which is a subcase of R/Y. 

Kimura's model with two parameters (K2P, also known as K80) corresponds to the re- 
striction of R/Y such that v x and w x do not depend on x. Jukes-Cantor's model (JC69) 
corresponds to the restriction of R/Y such that v x = w x and v x does not depend on x. 
Tamura's model, as it appears in the aforementioned paper by Duret and Galtier, is inter- 
mediate between HKY85 and K2P since it corresponds to the restriction of R/Y such that 
va = vt, vc = vg, and w x jv x does not depend on x. 

We summarize this as a proposition. 
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Proposition 4 The model JC69 is a strict subcase of K80, which is a strict subcase of 
both HKY85 and F84, which are both strict subcases of TN93. All these, and Tamura's 
model in Buret and Galtier (2000), are strict subcases of R/Y. 

Finally, we mention that the general time-reversible model (GTR) is not comparable with 
R/Y, in other words some matrices of rates of substitutions are GTR but not R/Y, and 
vice versa. The intersection of GTR and R/Y is TN93. As a consequence, the complement 
of TN93 in R/Y contains only non-reversible models. 

1.3 YpR substitutions 

We generalize the CpG mechanism of substitutions considered by Duret and Galtier (2000), 
adding specific rates of substitutions from each YpR dinucleotide as follows. 

• Every dinucleotide CG moves to CA at rate r A and to TG at rate . 

• Every dinucleotide TA moves to CA at rate r^ and to TG at rate Tq. 

• Every dinucleotide CA moves to CG at rate rg and to TA at rate r^. 

• Every dinucleotide TG moves to CG at rate rg and to TA at rate r T A . 

The rationale for these notations is as follows. The rates r v x are indexed by the nucleotide x 
produced by the substitution, and by the nucleotide y completing the YpR dinucleotide xy 
that the substitution yields, when £ is a pyrimidine, or completing the YpR dinucleotide 
yx that the substitution yields, when x is a purine. In other words, y is the nucleotide not 
affected by the substitution. 

Definition 5 R/Y + YpR models correspond to the superposition of rates of substitutions 
of an R/Y model and of rates of substitution r y x of dinucleotides YpR, as described above. 

To recover the model of Duret and Galtier, one should assume that 

r C A = r$, 4 = r T G = rg = 4 = rg = r T A = 0. 

R/Y + YpR models use 16 mutation rates, namely, 4 parameters v x for the transversions, 
4 parameters w x for the transitions, and 8 parameters r x for the mutations involving YpR 
dinucleotides. To multiply these 16 parameters by the same scalar changes the time scale, 
but not the evolution itself nor the stationary distributions, hence one can consider that 
the class R/Y + YpR has dimension 15. We allow for possibly negative values of the rates 
r x and the model makes sense if the following inequalities are satisfied: 

v x >0, w x ^ 0, w x + r y x ^ 0. 

Finally, note that the rates r x describe transitions, and never transversions. 

1.4 Sets of sites 

DNA sequences are represented by sequences of letters of the alphabet A. Although real 
DNA sequences are obviously finite, their typical length is rather large, hence one often 
considers them as doubly infinite sequences of letters, that is, as elements of the set A z . 
Then sites on the DNA sequence are identified to integer numbers in Z. Our mathematical 
results are most conveniently expressed in this setting. However, as we explain below, most 
of them are valid for suitable finite sets of sites as well. 
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2 Description of the results 



Roughly speaking, our main findings about R/Y + YpR models at equilibrium are the 
following. 

• One can compute the exact value of the frequency of every polynucleotide by solving 
a finite-size linear system. 

• One can simulate exact samples of sequences of nucleotides. 

• Some surprising independence properties between sites hold. 

Furthermore, the dynamics satisfies the following. 

• The dynamics converges to the unique equilibrium, for every starting distribution. 
To be more specific, we prove the following results. 

Theorem A (Construction and general properties) There exists a unique Markov 
process (X(s)) s >o on A z associated to the transition rates defined in section^ Under a 
generic non- degeneracy condition (ND), stated in section^ this process is ergodic, that is, 
it has a unique stationary distribution [i and, for any initial distribution, X{s) converges 
to fx in distribution as s goes to infinity. Moreover, \i is invariant and ergodic with respect 
to the translations in Z. 

From theorem EI equilibrium properties of the model are well-defined, for instance the 
equilibrium frequency of polynucleotides. 

Theorem B (Dynamics) There exists an i.i.d. sequence of marked Poisson point pro- 
cesses on the real line, indexed by the sites i, and a measurable function <1> with values 
in A such that, for every couple of times s ^ i and every site i, the value Xi(t) of site i at 
time t is 

Xi(t) = $(x l _ 1 (s),Xi( s ),x l+1 (s);^_i n [s,t},d n MUm n [s,t]). 

Theorem C (Statics) Assume that the distribution of (Aj)j is stationary. There exists 
an i.i.d. sequence of marked Poisson point processes on the real halfiine, indexed by 
the sites i, and a measurable function \P with values in A such that, for every site i, 

Theorems El and O describe some structural properties, at the basis of our subsequent 
results. We begin with some direct consequences. 

Definition 6 For every subset I of Z, let Adj(J) denote the set of integers i such that 
either i — 1 or i or i + 1 belongs to I. 

Proposition 7 (Dynamics) Consider sequences indexed by I. The restriction of the dy- 
namics to a subset of sites I does not depend on I, as soon as I contains Adj(J). 
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Proposition shows for instance that the behavior of the sites in {1, . . . , n} is the same, 
whether one considers that these sites are embedded in a sequence indexed by Z, or in a 
sequence indexed by {0, 1, . . . , n + 1}. If the sites are embedded in {0, 1, . . . , n + 1} or in a 
larger finite set, this means that the boundary conditions have no effect on the behavior of 
the sites in {1, . . . , n}, hence one can consider at will discrete intervals or discrete circles. 
For instance, one can decide, either that the only neighbor of is 1 and the only neighbor 
of n + 1 is n, or that has neighbors 1 and n + 1 and that n + 1 has neighbors n and 
0. This decision will modify the evolution at the sites and n + 1 but not at the sites in 
{1, . . . , n}. This remark concerns all the results that we state later on in this section. 

We come back to the consequences of theorems [E] and IU1 

Corollary 8 (Statics) Assume that the distribution of (Xi)i is stationary. Fix some sets 
of sites Ik such that the sets Adj(Jfc) are disjoint. Then the families (Xi)i E i k are indepen- 
dent from each other. 

The condition in the corollary means that, for every k ^ k', \i — i'\ ^ 3 for every i in Ik 
and every i' in Iy, For instance, the sequence {X-a)i^ is i.i.d. at equilibrium. 

Our next results deal with what is probably the main concern of biologists in relation to 
this model, namely, the actual computation of some equilibrium frequencies. 

Theorem D The equilibrium frequencies of polynucleotides solve explicit finite-size linear 
systems and can be expressed as rational functions of the parameters of the model. 

Theorem E (Nucleotides and YpR dinucleotides) The frequency of each nucleotide 
at equilibrium can be expressed explicitly as an affine function of the equilibrium frequencies 
of the YpR dinucleotides. Furthermore, the equilibrium frequencies of the YpR dinucleotides 
solve an explicit 4x4 linear system. 

We state theorem El more precisely as theorem [01 in section 18.21 Our last result is a 
consequence of the inner properties of our basic construction. 

Theorem F (R/Y sequences) Encode the sequence of nucleotides as an R/Y sequence 
of purines and pyrimidines. If the sequence of nucleotides at equilibrium, then one obtains 
an i.i.d. R/Y sequence with weights ta and ty, and 

V C + V T V A + V G 
ty '■= , tfi := . 

VA + Vt + Vc + Vg VA + Vt + Vc + Vg 

In particular, the values of ty and tR do not depend on the values of the YpR substitution 
rates. This fact reflects, once again, the specific property of R/Y + YpR substitution 
models that is at the basis of our analysis, namely, that the global model is equivalent to 
the superposition of the double substitution processes described by the rates r v x on top of 
the simple substitution processes described by the rates v x and w x , see section 0J This 
remark shows that one cannot compute analytically, at least along these principles, the 
stationary measure of substitution models that are not in the R/Y + YpR class, and in 
fact, we suspect that one cannot compute it at all. 
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3 Description of the paper 



Here is a moderately detailed description of the content and organization of the rest of the 
paper. 

Part A is devoted to a rigorous construction of the processes described informally in sec- 
tion ^ We stress that one could rely on the general principles in Liggett chapter 1], 
based on infinitesimal generators, to build Markov processes on the space of finite or infi- 
nite nucleotide sequences that correspond to the jump rates defined above. However, in the 
present case, a direct construction of the dynamics is possible, which yields straightforward 
proofs of important structural properties of the process and possesses some interesting cou- 
pling properties. In section 01 we give some notations and definitions. In section we 
explain the construction of the process when the number of sites is finite. In section 
we deduce from this the case when the number of sites is infinite. Finally, section deals 
with the simplifications related to the encoding of the nucleotide sequences as sequences 
of purines and pyrimidines. 

Part B is devoted to the actual computation of some quantities of interest for the model 
at equilibrium. From structural properties of the process, described in part A, computing 
the equilibrium frequencies of polynucleotides amounts to solving finite-size linear systems. 
Moreover, one can express the nucleotide frequencies as functions of the YpR dinucleotide 
frequencies, and these, in turn, solve a 4 x 4 linear system. Section |H1 explains this in the 
general case. 

The remaining sections of part B are devoted to restricted versions of the model, which 
involve a reduced number of free parameters. This avoids, first, cumbersome formulas 
that would depend on a large number of free parameters, and second, the prohibitive 
computational burden involved in symbolically solving large linear systems. Section |H1 
deals with uniform simple substitution rates and double substitution rates from CpG and 
TpA only. Section ^] deals with models that are invariant by the classical symmetry of 
DNA strands. This case has biological significance since the invariance reflects the well 
known complementarity of the two strands of DNA molecules. Section deals with 
the simplest non trivial version of the R/Y + YpR model, namely, the case when all 
the simple substitution rates coincide and there are no double substitutions except from 
CpG to CpA and TpG, both at the same rate. In this setting, we provide the exact 16 
dinucleotide frequencies (that is, not only the 4 YpR frequencies), a perturbative analysis of 
the frequency of every polynucleotide at vanishing CpG substitution rates, and we describe 
the non degenerate limit of the model at high CpG substitution rates. The relatively shorter 
section El explains the dynamics of the model, that is, its evolution to the stationary 
measure. The results of this section are valid in our general setting but we expose them in 
the simplest case. 

Part C deals with the actual simulation of these systems. Coupling-from-the-past tech- 
niques are pivotal here, as explained in section EH Section El delves into the details of the 
effective implementation of the CFTP method in the context of R/Y + YpR systems, and 
provides the basic schemes of two different algorithms that perform this sampling method. 
Once again, the crucial fact here is that one can use finite-size sets of sites to simulate the 
behavior of smaller finite-size sets of sites. 

Finally, section provides an index of the main parameters and notations used in the 
paper. 
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Part A Construction 



4 Notations and definitions 

For every topological space E and every real number <r, V(a, E) denotes the space of 
cadlag (right continuous with left limits) functions from [a, +00) to E, equipped with the 
Skorohod topology and the corresponding Borel cr-algebra. 

Let I denote the collection of nucleotide sites, thus I may be either the integer line Z or 
a finite interval of integers. Due to technical reasons that will become apparent when we 
explain the construction of the dynamics, I must contain at least 3 sites. 

We recall from definition ^ in the introduction that A := {A, T, C, G} denotes the nucleo- 
tide alphabet, that A and G are purines, encoded by R, that C and T are pyrimidines, 
encoded by Y, and that the mapping n is defined on A by 

ir(A) := R =: ir(G), 7r(C) := Y = : tt(T). 

We now define other mappings on A. 

Definition 9 Let p denote the application which fuses the two purines together, and r\ the 
application which fuses the two pyrimidines together, that is 

p(A):=R=:=p{G), p{C) := C, p(T):=T, 

and 

V (A):=A, V (G):=G, V (C) := Y =: V (T). 

For every nucleotide x in A, let x* denote the unique nucleotide such that {x, x*} = {A, G} 
or {x,x*} = {C, T}. In other words, x 1— > x* is the involution of A such that 

A* := G, T* := C, C* := T, G* := A. 

For every subsets K and J of the integer line such that J C K, let Tl J K denote the canonical 
projection from A K to A J . When J = {a, . . . , b} for two integers a ^ b, we often omit 
the mention of K and write H a,b for 11^. In other words, if K contains {a, . . . , b} and if 
x := (x k ) keK , then 

n a < b (x) := (x k ) a ^ b . 

Let S + denote the set of countably infinite locally finite subsets of the real line R, and let 

5 := S + U {0}. We equip S with the usual c-algebra in the context of point processes, 
namely, the smallest cr-algebra such that, for every Borel subset A of the real line, the 
function iV 1— > card(A r n A) is measurable. 

We consider collections £ of elements of 5, defined as follows. Let ^4o denote the disjoint 
union of five copies of A, say 

Aq := Au U Ay U A w U Ar U Aq. 

Let f^o denote the space 

no:=5^° XI . 

We call J^o the product cr-algebra on f^o inherited from that of S. Let £ be a generic 
element of Qq. Hence £ may be written as 

£ =: (£i)iel> where each £j belongs to S Ao . 
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For every x in A, let Uf(£) denote the x-co ordinate of £j in Au, hence Uf (£) belongs to 
S. Likewise, Vf(£), Wf(£), 1Zf(£), and Qf(£) respectively denote the rc-coordinates of ^ 
in Ay, Aw, Ar, and Aq, and belong to S as well. In other words, for every £ in and 
every z in I, 

Recall that, for every real number r, the positive part r + and the negative part r~ of r are 
both nonnegative and defined by the relations r = r + — r~ and |r| = r + + r _ . 

Definition 10 For ewen/ nucleotide x in A, the combined rate of substitution c x is the 
nonnegative real number defined as 

c x := w x - m&x{( r y)-, (rf )"}, 

where {y,y*} = {C,T} if ir(x) = R and {y,y*} = {A, G} if ir(x) = Y. 

Finally, the probability measure Q on (Ooj-^o) is such that, for every site i in I and every 
nucleotide x in A, 

• Uf is a homogeneous Poisson process on the real line with constant rate mm(v x ,c x ), 

• Vf is a homogeneous Poisson process on the real line with constant rate (v x — c x ) + , 

• Wf is a homogeneous Poisson process on the real line with constant rate (c x — v x ) + , 

• IZf is a homogeneous Poisson process on the real line with constant rate \r x \, where 
the rate r x corresponds to YpR substitutions starting from CpG or TpA, 

• Qf is a homogeneous Poisson process on the real line with constant rate \r x \, where 
the rate r x corresponds to YpR substitutions starting from TpG or CpA. 



Thus, Q is uniquely specified by the following additional condition. 



• The Poisson processes Uf , Vf , Wf , 7£f , and Qf , for every site i and every nucleotide 
x, are independent. 

One sees that every Uf U Vf is a homogeneous Poisson process with constant rate v x and 
that every Uf U Wf is a homogeneous Poisson process with constant rate c x . 

We now provide a brief and intuitive description of the construction of the dynamics of the 
process, using these Poisson processes. As is usual, the points in the processes £j are the 
ringing times of exponential clocks that rule the evolution of the sites. There exists five 
types of moves, labelled as U, V, W, R, and Q. 

• Type U. When an exponential clock attached to Uf rings, the nucleotide at site i 
moves unconditionally to the value x. 

• Type V. When an exponential clock attached to Vf rings, the nucleotide at site i 
moves to the value x provided that this move corresponds to a transversion. 



Type W. When an exponential clock attached to Wf rings, the nucleotide at site i 
moves to the value x provided that this move corresponds to a transition. 
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• Type R. When an exponential clock attached to TZf rings, the nucleotide at site 
i moves to the value x in the following cases: if this move corresponds to a YpR 
substitution from CG or TA when the associated rate r% 0, and if this move 
corresponds to a transition but not to a substitution from CG or TA when r% < 0. 

• Type Q. When an exponential clock attached to Qf rings, the nucleotide at site 
i moves to the value x in the following cases: if this move corresponds to a YpR 
substitution from TG or CA when the associated rate r% ^ 0, and if this move 
corresponds to a transition but not to a substitution from TG or CA when r v x < 0. 

The rates of the Poisson processes are chosen in order to couple as strongly as possible the 
transitions and the transversions that yield the same nucleotide and to take properly into 
account the inhibitory effect of YpR mutations when some rates r% are negative. 

We introduce a subset of Qq, defined by the following conditions. 

• The sets Vf are disjoint, for every nucleotide x, every site i, and for every symbol V 
in the set {U,V,W,K, Q}. 

• For every site i, there exists a symbol V in the set {U, V, W, 1Z, Q} and a nucleotide 
x in A such that the set Vf is infinite. 

We also introduce a non-degeneracy condition. 

• (ND) For every nucleotide x in A, v x and c x are positive. 

Under condition (ND), Q(f\) \ ^i) = 0. To avoid the handling of tedious exceptions, we 
assume from now on that condition (ND) holds and we work exclusively on Qi, equipped 
with the Borel cr-field and the probability measure that are induced by those of (Qq, J^o, Q). 
We denote this new probability space by (Qi,.Fi,Q). 

5 Construction on finite intervals 

Before considering the full integer line, we define the dynamics on finite discrete segments, 
with periodic boundary conditions. The choice of boundary conditions is somewhat arbi- 
trary, and one could use instead free boundary conditions or fixed boundary conditions. 
However, the dynamics with periodic boundary conditions is invariant by the translations 
of the discrete circle, and this fact turns out to be quite useful since it reduces the dimension 
of the linear system which yields the invariant distribution, see section El 

5.1 Definitions and notations 

In this whole section we fix two integers a and b such that a + 2 ^ b and we assume that 

I:= {a,. ..,&}. 

The definitions below depend on the choice of I but, to alleviate the notations, we do not 
always mention explicitly the dependence. 
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For every site i in I, we introduce as the neighbor of i to the left of i and r(z) as the 
neighbor of i to the right of i. More precisely, 

:= i — 1 if i 7^ a, 1(a) := b, r(i) := i + 1 if i 7^ 6, r(i) := a. 

Since a + 2 ^ 6, for every site i, the sites i, l(i) and r(i) are three different sites. 

Fix a sequence x = (xj)iei in .A 1 and a site i in I. Our next definitions are related to the 
moves of types Q and R, defined in section QJ Say that: 
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of 
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to 
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site i 
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= TG. 
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substitutions 
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type 
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to 
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at 


site i 


if XiX r u\ 


= CA. 



Although this terminology makes little concrete sense when some rates r y x are negative, we 
use it even then. 

5.2 Construction 

The goal of this section is to define, for every real number a, a measurable map 

ifl : A 1 x Oi ^V{a,A l ). 

Intuitively, each function <£>*(x, £) describes the dynamics in A 1 that starts from the con- 
figuration x at time a and uses the moves prescribed by the realization £ of the Poisson 
processes. To be specific about the notations, we write 

¥^(x»0(«) = (¥£( x >£.»>*)) ieI - 

In other words, ¥v( x ) £, i, s) stands for the ith coordinate of the value of the function 
¥v( x > £) a ^ time s ^ cr, hence (x, £, i, s) belongs to A. 

From now on, we fix £ in J7i and x in A 1 , and, to alleviate the notations, we omit to 
mention the dependence with respect to a, b, a or £ of various quantities. Introduce 

T:=\jT(i), T(i):=(a,+oo)n IJ^fUVf UWf UTefUQf. 
iel zeA 

Let t_i := cr and (to, tl, • • • ) denote the ordered list of points in T, that is, 
a < to < h < ■ ■ ■ , and T =: {t , ti, t2, • • •}• 
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For every n ^ 0, let c n denote the site where the nth move occurs that affects a site in I 
after the time a, and let M n denote the description of this move. That is, 

c n = i and M n = (z, U) if t n belongs to Uf . 

Likewise, c n = i and M n = (z,V), M n = (z,W), M n = (z,R), and M n = (z,Q) respec- 
tively, if t n belongs to Vf, Wf, TZf, and Qf respectively. We often consider that M n belongs 
to Ao, for instance M n = (z, U) may be identified with z in Au- Alternatively, the second 
component of M n is considered as a flag, it takes its values in the set {U, V, W, R, Q}, and 
it is often denoted by /. In any case, the variables c n and M n are uniquely defined for 
every £ in 

We now define a map 

71 : A 1 x A x I -> A. 

Fix a sequence x := (xj)j 6 i in A 1 , a move description M = (z,f) in Ao, and a site i in 
I. The definition of the i-coordinate of 71 (x, z, /, i) differs from the definition of the other 
coordinates. More precisely, one sets 

7i(x,z,/,i)j := z, 

if one of the following conditions is met. 

• The flag / = U. 

• The flag / = V, and Xi is a purine and z is a pyrimidine, or vice versa. 

• The flag / = W, and Xi and z are both purines or both pyrimidines, 

• The flag f = R and the type R rate rf > and x accepts the substitutions of type 
R to z at site i. 

• The flag f = R and the type R rate r y z < and x does not accept the substitutions 
of type R to z at site i and Xi and z are both purines or both pyrimidines. 

• The flag f = Q and the type Q rate rf > and x accepts the substitutions of type 
Q to z at site i. 

• The flag f = Q and the type Q rate rf < and x does not accept the substitutions 
of type Q to z at site i and Xi and z are both purines or both pyrimidines. 

In every other case, that is, if j = z and none of the conditions above is met, or if j 7^ i, 
one sets 

ji(yL,z,f,c)j := xj. 

This defines the map 71. We now construct the map ip^, using an induction over increasing 
values of the time s ^ a. The initial condition is that, for every a ^ s < to, 

</4( x ,0( s ) : = x - 

Assume now that y*(x, is well-defined for every time s such that a ^ s < t n . Then, 
for every time s such that t n ^ s < t n+ \, one sets 

v£(x,0(*) :=7i(^(x,6(*n-i),M n ,c n ) . 

This defines a configuration (x, £)(s) for every time s ^ a. Since I is finite, measurability 
issues are obvious here. 
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5.3 First properties 



Recall that in this section 03 I denotes the finite interval I = {a, . . . , b}. We record some 
immediate properties of the family of maps ip*, making use of still another notation. 

Definition 11 For every time set T, let £T := (£jT)j e i, where 

iiT := (u?(0 n T, v?(£) n t, Wf (0 n T, Kf(0 n t, Qf (£) n r)^. 

Proposition 12 (1) For even/ s ^ i > cr ; 

^(x,0W = vJ W(x,O(*),0 (-)• 

(2) For even/ s ^ ane? every rea/ number t, 

<fl+t (x.,S + t)(a + t + s) = <pl(x.,Z)(a + s), 

where £ + i denotes the result of the addition oft to each component of £. 

(3) Finally, y>*(x, depends on £ only through £[a,s]. 

The function (/?q( x ) •)> defined by 

s i ► ^(x, •)(«), 

is a random variable on (0,%, J 7 !, Q) with values in 2?(0, .A 1 ). Let P* denote its distribution. 
Then, from proposition El and from the translation invariance of Poisson processes, the 
family of measures 

defines a Markov process in the sense of Liggett jj3 chapter 1]. Moreover, it corresponds 
to the specification of the process on A 1 given in our section ^ in terms of transition rates, 
with periodic boundary conditions. 

From now on, we use the notation 

Xi(s) := <^(x, •)(*). 



Proposition 13 For every finite interval I and every initial configuration x, the Markov 
process (X*(s)) s ^o is ergodic. In other words, there exists a unique stationary distribution 
\i\ on A , and for every x in A , X^(s) converges in distribution to /ij when s — * +oo. 

Proof of proposition 1131 

Immediate since (X*(s)) s ^o lives on a finite state space and is irreducible from the non- 
degeneracy assumption (ND) at the end of section^ □ 
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5.4 Dependencies 

We now make a fundamental observation. 

Proposition 14 For every initial sequence x := (a?,)j 6 i, every site i inl and every time 
s ^ a, the nucleotide y>*(x, £, i, s), which depends a priori on all the information contained 
in x ane? is in fact measurable with respect to the following restricted initial conditions 
and restricted sources of moves: 

x\(i), Xi, x r (i), £,\(i)[o-,s], &[<r,s], 

More precisely, there exists measurable maps 6 CT)S , independent of a and b, and such that, 
for every integers a and b such that a + 2 ^ b, and for every a ^ i ^ b and s a, 

ip a a ' b (x, £, i, s) = 9 CT)S (x, {i) , Xi , x r{i) , [a, s] , & [a, s] , £ r(i) [a, s] ) . 
A straightforward consequence of proposition fTH is proposition fT5l below. 

Proposition 15 Let a, b, c\, d\, C2 and d<i denote integers such that 

c k + 1 < a ^ b ^ d k - 1, fc = 1, 2. 

Fia; some initial conditions xi in andx.2 in A^ C2 ''"' d2 ^ that coincide on the interval 

{a — 1, a, . . . , b, b + 1} ; that is, such that 

n a -W( Xl ) =n a - 1 '*- 1 (x 2 ). 

Then, for every s ^ a, 



5.5 Proofs 

The proof of proposition [Til relies on lemma ITKl below. Recall definition |H1 of p and ry. 

Lemma 16 For every site % inl and every time s a, the functions 

Q [^( X >C, '(«)>*)] i y£(x> »7 [p*(x,f,r(*),s)] , 

which depend a priori on all the information in x ana 1 £, are m /ac£ measurable with respect 
to the following restricted initial conditions and restricted source of moves: 

Q{x\(i)), Xi, r)(x r (i)), (,\(i)[o-,s\, £i[(T,s], Zr(i)[(T, s]. 

More precisely, one may define a measurable map ^ a ,s, independent of a and b, and such 
that, for every integers a and b such that a + 2 ^ b, every a ^ i ^ b, and every s ^ a, the 
triple 



Q 



<> 6 (x, e, i(o, a)] , ^- fc (x, e, i,s), v [^ 6 ( Xj e, ko, «)] ) 



coincides with 

^v,s (Q(x\ty),xi,7)(x r (i)),€\(i,)[a, s],£i[a, s],£ r(i) [cr, s]) . 
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Proof of lemma FTTH 

Fix a source of moves £ in Q\, an initial configuration x in A , a site i in I, and let 

r* = T(i(<))ur(i)ur(r(i)), 

where the sets T(-) are defined in section PT^I Let (tj, , ■ - • ) denote the ordered list of 
points in T*, that is, 

T* = {tl,tl,tl,...} where a < t* < t\ < ■ ■ ■ , 

and set t* 1 = o\ Then, for all n ^ 0, we describe the move that occurs at time t* through 
the description M* = /*) of this move and the site c* where the move occurs, as 
defined in section f5~2l Note that the moves that affect the sites i and r(i), can only 
occur at one of the times i* for n ^ 0. 

Thus, to prove lemmaE3 it is enough to prove that, for all n ^ —1, 
depend only on M* +1 , on c* +1 , and on 

To this aim, we first assume that c* +1 = i and we examine the value of the flag 

• If fn+i = U, V or W, by the very construction of the process, (x, £, i, is 
determined by (x, £, i, t* ), /* and z* n . 

• If = R or Q, ^(x,£,i,t; +1 ) is determined by (x, £, i, t* ), by 2*, and by the 
knowledge of whether or not the sequence accepts the R or Q substitutions to z* n at 
site i. In turn, this only depends on 

efcJ-fofi'COjO] > y£( x >£,Mn)> |V*( x >f> r (*)>*n)] • 

This settles the case when c* +1 = i. Now we assume that c* +1 = l(i) and we examine the 
value of the flag f^+i- 

• If fn+i = U, V or W, y?*(x, £, depends on and on whether the nu- 
cleotide <^ct( x ) £> '(0>*n) * s a P urme or a pyrimidine. This information is provided by 

et^fc&K*),**)]. 

• If = R ox Q and z* = C or = T, whether the sequence accepts the substitu- 
tions of types i? or Q at site l(i) depends only on g (x, £, l(i), i* )] and (x, £, i, t£). 

• If = R or Q and z* n = A or z* n = G, we observe that the corresponding substitu- 
tion of types i? or Q at site l(i) could only turn an A to a G or vice versa. This does not 
affect the value of g [<p*(x, £, t^+i)] , which must be equal to g (x, £, l(i),i*)] . 

This settles the case when c* +1 = l(i). Since symmetric arguments hold when c* +1 = r(i), 
this concludes the proof. □ 

Proof of proposition 1151 

Let n a-1,b+1 (xfc) := (xj) a _i^j^5 + i. The result follows from proposition ITU since, for every 
i in {a, . . . , 6}, 

^• dl (xi,£,M) and ^ 2 ' d2 (x 2 ,£,i,s) 
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are both equal to 



®<t,s i,Xi,Xi+i,£i-i [a, s] , & [a, s] , £ m [a, s] ) . 



□ 



6 Construction on the integer line 

The construction of the process on the integer line Z relies crucially on a consequence of 
proposition El above, namely, the fact that, for every initial sequence x, every site i in Z, 
every source of moves £, and every couple of times s ^ a, the value of 

^• 6 (n a > 6 (xU,M) 

does not depend on a and b as soon as 

Hence the projective limit of the system b )a,b when a — ► — oo and b — > oo exists trivially 
and defines a mesurable map 

Some previous observations about <£v' 6 translate immediately to 
Proposition 17 (1) For every s ^ t ^ a, 

(2) For ewen/ s ^ and t, 

$ a+ t(xi,Z + t)(o- + t + s) = $ a (x,0((T + s). 

(3) Finally, <J?o-(x, £)(s) depends on £ cm/?/ through £[o~,s] = s])iez- 
For every x in „4 Z , let P x denote the distribution of 

s ^ $o(x,0O), 

viewed as a random variable on (f2i , ^"i , (Q) with values in T>(0,A z ). Then, as can be 
checked from propositions ED and El using the translation invariance of Poisson processes, 
the family 

defines a Feller Markov process in the sense of Liggett (HI chapter 1]. From now on, we use 
the notation 

X s x = $ (x, •)(*), 

and we sometimes omit the initial condition x of the Markov process (X s ) s ^q. 

It is straightforward to check that the construction in [JJJ, based on infinitesimal generators, 
yields the same process. Let us call Q the infinitesimal generator of the process yielded by 
the construction in [H] , then Q is well-defined and explicitly known at least for the Lipschitz 
functions on A z . 
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Let (X][(s)) s >o denote the Markov process on denned by 

xW \ A else. 

By definition, X*(s) converges to Xf as I — > Z. On the other hand, since the process 
(X^(s)) s >o involves moves only on the finite set of sites I, its infinitesimal generator Q\ 
can be readily computed. Moreover, Q\ converges to Q as I — > Z, at least on the set of 
real valued Lipschitz functions defined on A z . This is enough to identify (X s ) s ^q with the 
process yielded by the construction of jS], according to Corollary 3.14 in [§]. 

A consequence of proposition El above is the following result. 

Proposition 18 For every sequence x := (xi)^, every integer i and every couple of times 
s ^ a, $cr(x, £, i, s) depends on x and £ only through 

Xi-i, Xi, x i+ x, &_i[<7,s], &[er,s], £ i+ i[<j,s]. 

Indeed, using the function O^ defined in proposition [7^1 

$ CT (x, f , i, s) = e CTiS , , a? i+ i , Ci-i [cr, s] , & [o-, s] , &+i [(J, s) ) . 

A simple consequence of proposition ^] is the following proposition. 

Proposition 19 Fix x and some subsets I of the integer line at distance at least 3 from 
each other, that is, such that for every such pair I ^ I' of subsets and every sites i in I 
and i' in I', \i — i'\ ^ 3. Then, the collections [(Xf^o)i\ ie j are independent. 

Here are some simple examples. 

• The collections [(X s -^o)i] i<0 and [(X s ^o)i] i>3 are independent from each other. 

• The collection [(X s ^o)4i> PCs^o)4i+i]j e z ' s ma de of i.i.d. random variables with values 



m 



A 2 . 



The remaining values form a collection [(X s ^o)4i+2j (^s^o)4i+3]j g 2 > wmcri is also 
made of i.i.d. random variables with values in A 2 , with the same distribution as the 
collection in the preceding item, while being not independent from it. 

The collection [(X s ^o)si] ieZ , is made of i.i.d. random variables with values in A. 

For every £ ^ and every 5 ^ £ + 3, the collection [(-X s ^o){Si,...,Si+£}] igZ j is made of 
i.i.d. random variables with values in A e+1 . 



Proposition 20 There exists a unique stationary distribution of (X s ) s ^q on A . 

Definition 21 Let \x denote the stationary distribution of (X s ) s ^q. Let n a ^ denote the 
measure m whose existence is ensured by proposition \l,°A when I = {a, . . . , b}. 



A consequence of the uniqueness of // is its invariance by the transformations that preserve 
the dynamics, for instance the translations of Z. Likewise, fjL a ^ is invariant with respect to 
the translations of the discrete circle {a, . . . , b} = Z/(6 — a + 1)Z. Moreover, if {a, ... , b} 
is the image of {c, . . . , d} by a translation of Z, coincide with the image of fj, Ct d by this 
translation. Other properties are in proposition 1221 
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Proposition 22 (1) Assume that the integers a, b, c and d are such that 

c+ 1 ^ a < b < d- 1. 

Then, 

(2) The Markov process (X s ) s ^q is ergodic. That is, for every initial condition x in A 1, , 
Xf converges in distribution to fi as s — > oo. 

(3) Finally, the independence properties stated above hold with respect to fi as well. 
The considerations above justify the following definition. 

Definition 23 (Stationary frequencies) Define the stationary frequency F(xi ■ ■ ■ x k ) of 
every polynucleotide x\ • • • Xk as 

F(x! ■■■x k ) := no tk+ i(A x {(xi, x k )} x A). 
Hence, F(x\ ■ ■ ■ Xk) is also 

F( X1 ■ "Xk) = /J- x{(x 1 ,...,xk)}x ^*+ 1 ^+ a -->) . 

Due to the independence properties of \i stated in proposition |22l \i is clearly ergodic with 
respect to the translations in Z, so we may as well define stationary frequencies as: 

1 b 

F(xi---x k )= lim —) l(X i+1 -"X i+k = xi-"X k ) 

a— >-oo,ft^+oo b — a + 1 

i=a 

where the distribution of (-Xj)jgz is fJ>, 1(^4) denotes the indicator function of the event A, 
and the above limit holds in the almost sure sense. 

Proof of propositions 1201 and 1221 

Fix a and b such that a ^ b. From proposition El for every x in A z and every s ^ 0, 

u a,b p^x) = u a,b ^ x a-l,b+l^ 

Now, according to proposition El X% 1,6+1 (s) converges in distribution towards fi a -i,b+i 
as s — > oo. As a consequence, H a,b (Xf) converges to Tl a,b ([i a -i : b+i)- Proposition [T5l again 
shows that 

TCI, ft 



n a ' {fla-l,b+l) , , 
/ a^fa 

is a coherent family of probability distributions. Hence there exists a unique probability 
distribution \i on A z such that n°- 6 (/i) = n a > b (/j, a -1,6+1 )• The convergence of H a ' b (Xf) in 
distribution towards Tl a,b (fi) for every a ^ b implies that, for every x in A z , Xf converges 
in distribution to fj,. The existence and the uniqueness of an invariant distribution, equal 
to fj,, follow easily. The other properties are obvious. □ 



7 R/Y encodings 

The situation of the R/Y process is even simpler. 
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7.1 On finite intervals 



Definition 24 Introduce the R/Y configuration at time s as the {R,Y} -valued function 

Zl(s) := 7r((/£(x,£,i,s)). 

Proposition 25 For every sequence x := (xi)i e i, every site i in I and every time s ^ a, 
the value of Z^.(s), which depends a priori on all the information in x and is in fact 
measurable with respect to 7r(xj) and £[cr, s]. 

Proof of proposition 1251 

The substitutions associated to clocks of types W, R and Q do not change the value of 
Z\(s). Hence the moves of the function s \— > Z\{s) are determined by the clocks U{ and Vj. 
Erasing the IZi and Qi clocks is like setting every rf to 0, in which case the sites evolve 
independently This proves the proposition. □ 



Corollary 26 For every collection of sites I ; the coordinates of 

are independent and each converge in distribution, when s —> oo, to the distribution on 
{R, Y} equal to ty <5y + t R 5 R , that is, 

Pi((Z s ) t = Y) -> t Y , Fl((Z s )i =R)^ t R . 

We recall from theorem El in section [21 that ty + tn = I and 

v c + v T . v A + v G 

ty := , ir := . 

VA + Vt + Vc + Vg VA+ Vt + V C + Vg 

7.2 On the full line 

Proposition [23 below is a straightforward consequence and equivalent to theorem El in 
section [21 

Proposition 27 There exists a unique stationary distribution of(Z s ) s ^o on {R, Y} z . This 
measure is the product mesure v® 1 * , where 

u(Y) := ty, u{R) := t R . 

Theorem El yields relations, which hold irrespective of the values of the mutation rates r%, 
namely 

F(CG) + F(CA) + F{TG) + F(TA) = tyt R , 
F(CC)+F(CT)+F(TC)+F(TT) = t 2 Y , 
F{AC) + F{AT)+F{GC)+F{GT) = t R t Y , 
F(CC) + F(CT) + F(TC) + F(TT) = t 2 R . 

These relations are simple enough to write. However, they yield awkward formulas for the 
individual frequencies of nucleotides and dinucleotides, in full generality. 
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We now comment on theorem [F] an d proposition 1271 This phenomenon is a consequence 
of the construction with exponential clocks, where the locations of the YpR dinucleotides 
coincide, before and after the superposition of the exponential clocks which represent the 
YpR mutations. Hence the overall frequency of these 4 dinucleotides should be the same 
for every values of the rates r*. Indeed the right hand side is the value of their overall 
frequency when = for every xy. Also, this can be checked directly from the equations 
which describe the equilibrium of these 4 dinucleotides. Finally, the product t^ty in the 
right hand side coincides with the product of the frequencies of the pyrimidines and of the 
purines, that is, the product 

F(Y) F(R) = (F(C) + F(T)) (F(A) + F(G)). 

Once again, this can be seen on the construction with exponential clocks. Property (a) 
means that one can fuse C and T into a single state Y (pyrimidine) , and G and A into a 
single state R (purine). The YpR substitutions have no effect on R and Y, hence the sites 
are i.i.d. Each pyrimidine mutates at rate 

SR ■= V A + W T + W C + V G , 

to a purine with probability (va + vg)/sr and to a pyrimidine otherwise. Each purine 
mutates at rate 

Sy := wa + vt + Vc + Wg, 

to a pyrimidine with probability (vt + vq)/ sy and to a purine otherwise. The net effect 
is that every pyrimidine becomes a purine at rate va + vq and that every purine becomes 
a pyrimidine at rate vq + vt, hence the stationary measure F(Y) := F(C) + F(T) of the 
pyrimidines is proportional to vq + vt and the stationary measure F(R) := F(A) + F(G) 
of the purines is proportional to va + vg- 

Part B Computation 

8 General case 

8.1 Polynucleotide frequencies 

From the construction given in part A, knowing the stationary distribution of the Markov 
process (X I (s)) s >o with I = {a — 1, . . . , b + 1} is enough to compute U. a ' b (/j,). Since 
(X l (s)) s >o lives on the state space A 1 , computing its stationary distribution amounts 
to solving a linear system of size #-4 I x #*4 : . Computing the equilibrium frequency of 
polynucleotides of length thus requires solving a A N+2 x A N+2 linear system. 

Theorem [Dl in section follows from these considerations and from Cramer's formula. 
However, even moderate lengths of polynucleotides (N = 4, say) lead to fairly large linear 
systems, so finding ways of lowering the dimension of the system to be solved is a critical 
issue, if solvability of the model is to be considered something more than a mere theoretical 
possibility. 

For single nucleotides and the restricted class of YpR dinucleotides, an autonomous linear 
subsystem can be isolated, and this is discussed in section l8~2l below. 

For general polynucleotides, symmetries can be used to reduce the computational burden. 
Using the invariance of (A I (s)) s >o with respect to translations on the discrete circle, the 
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linear system yielding the stationary distribution of (X I (s)) s >o can be reduced to a linear 
system of size 

m(N + 2) x m(N + 2), 

where m(k) denotes the number of distinct orbits in A k under translations. It is a well- 
known counting result, see e.g. jS], that 

d\k 

where <\> stands for Euler's indicator (the number of primes to a number). Hence, when 
k — > oo, 

m{k) ~ A k /k. 

This remark also achieves significant improvement for small values of N, see the table 
below. The last columns give the value of l/(N + 2) and the exact ratio m(N + 2)/4 7V+2 
of the sizes of the two linear systems. 



N 




m{N + 2) 


l/(N + 2) 


Ratio 


2 


256 


70 


25% 


27.3% 


3 


1024 


208 


20% 


20.3% 


4 


4096 


700 


16.7% 


17.1% 



Remark 28 One could think of the following alternative reduction to lower the size of the 
linear systems. From the results of part A, to find H a ' b with I = {a — 1, . . . , b+ 1}, 

it is sufficient to find the invariant distribution of the Markov chain 

(p(X I (a) _ 1 ),n ' 6 (X\s)) , v (X l ( s ) b+1 )) . 

This chain lives on a state space of size 4 ff x 3 2 . Hence, this remark reduces the size of 
the system by a factor 9/16 ~ 56%. On the other hand, the translation invariance is lost. 
All in all, using the translation invariance described above yields more effective reductions. 

Two approches to computing equilibrium frequencies of polynucleotides may be considered. 
One can solve the linear system numerically with fixed values of the parameters (with finite 
or infinite precision arithmetic), within reasonable time for N ^ 4. One can also solve this 
symbolically, a task that we performed only for N = 2, using a restricted version of the 
model possessing a single free parameter, and additional symmetries, see section fTTl 

8.2 Nucleotide and YpR dinucleotide frequencies 

Recall that F{x% ■ ■ ■ Xk), introduced formally in definition l23l in section denotes the sta- 
tionary frequency of the polynucleotide x± ■ ■ ■ In this section, we show how to compute 
F(x) for every nucleotide x, and F(xy) for every YpR dinucleotide xy. 

Definition 29 Introduce 

F(Y) := F(C) + F(T), F(R) := F{G) + F(A). 
Similar conventions are valid for polynucleotides, for instance 

F(YR):= E 

tt(x)=Y w(y)=R 
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Likewise, 

F(Yy) := F(Cy) + F(Ty), F(xR) := F(xA) + F{xG). 

We introduce some notations, related to the rates of the simple substitutions. 

Definition 30 For every nucleotide x, let s x denote the sum of the rates of mutations 
from x, hence sa '■= sr =: sq and sc '■= sy =: st, with 



SR ■= WA + VT + Vc + WQ, Sy ■= VA + WT + WC + V G- 



Likewise, let 



u, 



X X 

Finally, let tA '■= tR ='■ t G an d to '■= ty ='■ tT, with tR + ty = 1 and 

tR ■= (va + v G )/v, t Y := (v T + v c )/v. 

We turn to some notations related to the effect of the r v x substitutions on nucleotides. 

Definition 31 For every nucleotide x and every YpR dinucleotide yz, introduce p yz (x) as 
the rate at which x appears (or disappears if p yz (x) is negative) because of the dinucleotide 
substitutions associated to yz. Hence, 



Pcg(T) 
Pta{C) 
Pca(T) 
Ptg(A) 



r c 
r£ 



-Pcg(C), 
-Pta(T), 
-Pca(C), 
-Ptg(G), 



Pcg(A) 
Pta{G) 
Pca{G) 
Ptg{C) 



G 
r C 

G 
n G 

c 



-Pcg(G), 
-Pta(A), 
-Pca(A), 
-Ptg(T). 



Finally, for every x and every yz which is not a YpR dinucleotide, let 

Py Z (x) := 0. 

Equilibrium for the nucleotides yields the following relations. 
Proposition 32 For every nucleotide x, 

s x F(x) =v x -u x t x + y^Pyzjx) F(yz). 

yz 



Furthermore, F{R) = tR and F(Y) = ty . 

Note that the values of F(R) and F(Y) are independent of the YpR substitution rates r%. 
The underlying reason for this a priori surprising fact is in proposition |23 The proof of 
proposition E21 is in section FHTTH 

From proposition 02 the values of F{xy) for every YpR dinucleotide xy determine F(z) for 
every nucleotide z. To compute F{xy) for these 4 dinucleotides xy, we need some notations 
related to the effects of the r x mutations on dinucleotides. 

Definition 33 For every couple of YpR dinucleotides xy and zt, introduce p z t(xy) as the 
rate at which xy appears (or disappears if p z t(xy) is negative), due to the existence of the 
dinucleotide zt. Hence, assuming that {x,x*} = {C,T} and that {y, y*} = {A, G}, 



Pxy{xy) := -rj 



r x* ' 



Pxy{xy*) := rL, p xy (x*y) := r v x ,, p xy (x*y*) := 0. 
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For instance, 

p CG (CG) := -rg - r c A , PCA (CG) := rg, p TG (CG) := rg, p Tyl (CG) := 0. 
Finally, let 

Qxy ■■= -Pxyixy) = Ty, + 7^,. 

Equilibrium for the YpR dinucleotides yields the following relations. Recall the notations 
in definition l2T)l 

Proposition 34 For ewen/ YpR dinucleotide xy, 

(v + w) F(xy) + u x F{Yy) + u y F(xR) = v x F{y) + v y F{x) + ^p zt {xy) F{zt). 

zt 

The proof of proposition I3H is in section l8~3l 

From propositions and IH4| the 8 unknown frequencies we are looking for, namely the 
4 frequencies of the nucleotides and the 4 frequencies of the YpR dinucleotides, solve a 
system of 8 linear equations. One can show that the determinant of this system is not 
zero, hence the 8 frequencies are entirely determined by this system. 

A simpler way to proceed is to write the frequency of each nucleotide as an affine function of 
the 4 fequencies of YpR dinucleotides, then to plug these expressions in the 4 last equations. 
We state this as theorem IH1 below, which precises theorem IE1 in section [21 

Definition 35 Let F denote the 4x1 vector of the frequencies of the YpR dinucleotides, 
that is, 

( F(CG) \ 
F{CA) 
F(TG) 
V F(TA) J 



F := 



Theorem G (YpR frequencies) The YpR frequencies solve a linear system 

{{v + w)ld + V + W) - F = V, 

where the 4x4 matrices U and W and the 4x1 vector V are defined below and depend on 
the substitution rates v x , w x and r x . 

Definition 36 For every YpR dinucleotide, let 

v*:=v x /s R , v*:=v y /s Y . 

The matrix U is 



The coefficients of the matrix W are 

^xy,zt ■= ~Pzt{xy) - v*p zt (y) - v*pzt(x). 



1 u c + u G 


u G 


u c 


o \ 


UA 


U C + U A 





u c 


ut 





U T + U G 


UG 


V o 


ut 


U A 


U T + U A ) 
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Finally, the coefficients of the matrix V are 



Ux tx 



Note that every Y xy is positive. 

We now write the coefficients of W more explicitly. Each column W. tXy of W depends on 
the YpR rates of substitution through r^* and r y x , only, and through affine functions. More 
precisely, 

W IW = (l + «:)rJ. + (l+<)r^ 

Furthermore, 



W 



xy* ,xy 



[l + v*)r^+v* r r y x *, 



'x'y,xy 



and 



*^x*y* ,xy 



-v** r x * 

u x* ' y* 



* y 
Vy* r x * . 



For instance, the CG column of W is 



W. 



CG 



CG 
CA 
TG 
TA 



( r^ + rtf + v c r c A j s R + v G r%/ s Y \ 
-r c A - v c r c A js R + vaTt/sy 
~ r T ~ v G r^/s Y + v T r c A js R 
-v T r c A j s R - v A r%/s Y ) 



\ 



8.3 Proofs 

Proof of proposition 1321 

Assume that the distribution of (Xj)j e z is the stationary measure \i introduced in defini- 
tion!^ in section El By the definition of the dynamics, equilibrium for nucleotide x at site 
i reads 

s x F{Xi = x) = w x F{Xi = z)+ v x F(Xi = z) 

ir(z)=ir(x) n(z)j^n(x) 

+ Yl Py Z (x)nX i X i+1 =yz) 

ir(y)=ir(x)^iT(z) 

+ Y Pyz(x)F{X i ^ 1 X i = yz). 

n(y)^Tr(x)=Tr(z) 

Note that one of the last two sums in the expression above is always zero, which one 
depending on whether x is a purine or a pyrimidine. 

Using the translation invariance of the stationary distribution, and extracting constant 
factors from sums, this reduces to 

8 x F(x)F(yz)=Y,Pvz(x)+w x F ( z )+ V * J2 ^ 

y z tt(z)=tt(x) tt(z)j^tt(x) 

Using the fact that p yz (x) = —p yz (x*), and summing, on the one hand the above equi- 
librium equations for x = A and x = G, and on the other hand for x = C and x = T, 
we obtain a linear system of two equations involving F(R) and F(Y) = 1 — F(R). For 
instance, 

s R F(R) = {w A + w G ) F{R) + (v A + v G ) F(Y). 
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Solving this for F(R) and F(Y) yields the first assertion of the proposition, plugging these 
values into equation ([2j| yields the second assertion. □ 

Proof of proposition 1341 

As in the proof of proposition H121 above, we assume that the distribution of (Jfj)igz is the 
stationary measure /jl. Equilibrium for a YpR dinucleotide xy located at the pair of sites 
(i, i + 1) reads as the equality of the exit rate and the entrance rate. Both are due to single 
substitutions and to double substitutions. The exit rate of single substitutions has size 

(s x + s y )F(X(i)X(i + 1) = xy). 

The entrance rate due to the single transitions has size 

w x ¥{X{i)X(i + l) = zy)+ w y ¥(X(i)X(i + 1) = xt). 

tt(z)=tt(x) n(t)=ir(y) 

The entrance rate due to the single transversions has size 

v x P(X(i)X(i + l) = zy)+ ^ v y P(X(i)X(i + 1) = xt). 

Finally, the rate of double substitutions, counted as an entrance rate, has size 

Pzy(x)nX(i)X(i + l) = zy)+ Yl Pxt(y)nX(i)X(i + l)=xt). 

tt(z)=it(x) 7r(t)=7r(y) 

An important point to notice is that no YpR mutation affecting the pairs of sites {i — 1, i) 
or (i + l,i + 2) can occur when X(i)X(i + 1) = xy or lead to X{i)X(i + 1) = xy, since 
x is a pyrimidine and y is a purine. This fact rules out probabilities of trinucleotides 
from appearing in the above equation, which we could not avoid if xy was not an YpR 
dinucleotide. 

Using the facts that 

nX(i)X(i + l) = zy)=F(X(i + l)=y)- £ F(X(i)X(i + 1) = zy), 

tt(z)^tt(x) tt(z)=tt(x) 

and that 

P(X(i)X(i + 1) = xt) = P(X{i) =x)- F(X(i)X(i + 1) = xt), 

*"(*)^7r(2/) ir(t)=n(y) 

and the translation invariance of the stationary distribution, we obtain the identity stated 
in proposition 02J □ 



9 Uniform simple rates 

In this section, we study the influence of the rates r v x of YpR substitution rates, looking at 
the case when, for every nucleotide x, 

V „ = W r = 1. 
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Then, for every YpR dinucleotide xy, 

v = w = s x = Sy = 4, u x = 0, V xy = 1/2, v* x 



1/4. 



We assume, as a further simplification, that the YpR substitution rates from CpA and 
from TpG are zero, that is, 



r C 
r G 



4 = T% 



r T A = Q. 



Hence, W. jC A = W.,tg = 0, and 



/ hr c A + 5r^ \ 



,CG 



hr A + rq 
5r 



\ 



G 

T 
,G 



,TA 



'C 



G 

-5tq + r G 



G 



\ 5r£ + 5rg / 



Recall that qcc = r r + r A an d Qta 
Definition 37 Let 



A , T 
r C + r G- 



K := 32 + 5 (q C G + Qta) + 3 qcc Qta/^- 
Our following result gives the frequencies of the YpR dinucleotides. 

Proposition 38 Assume that the simple substitution rates are 1 and that no YpR substi- 
tution occur from CpA or TpG. Then, for every YpR dinucleotide xy, 



1_ ( k(xy) \ 



F(xy) = — I 1 + 
v yj 16 V K 



with 



k(CG) 


■= Qta - 


- 5 qcg - 3<7cg<?ta/4, 






k{CA) 


■= {$r c A 


-r£ )(l+3g rA /16) + (5r c - 




(l + 3q CG /W) 


k(TG) 


:= (5rf 


-7i)(l+39 TA /16)+(5r G ' 


-4) 


(l + 3gc7G/16) 


k(TA) 


■= qcc - 


- 5 QTA - SqcGqTA/'i- 







For instance, 



F(CG) 



32 + 6q TA 
16 K ' 



F(TA) 



32 + 6g CG 
16 K 



One can check that 



k[CG) + k(CA) + k(TG) + k(TA) = 0. 
The frequencies of the nucleotides follow. 

Corollary 39 In the setting of proposition 1^1 

32 + 6 qrA t 32 + 6 qcc 
~ r G ' 



AF{A) 


= 1 + 


AF(G) 


= 


4F(C) 


= 1-rf 


AF(T) 


= 1 + rf 



16 K 
32 + 6grA 

16 K 
32 + 6 

16 K 

32 + 6 95 

16 X 



16 K 

T 32 + 6 9cg 
+ ? G' 



+ r c 
T 

~~ r C 



16 K 



16 K 



16 K 
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If one assumes furthermore that = = 0, that is, that CpG is the only active 
dinucleotide, further simplifications occur. 

Proposition 40 Assume that the simple substitution rates are 1 and that no YpR substi- 
tution occur from CpA, TpG or TpA. Then, for every YpR dinucleotide xy, 

F(xy) = i ( 1 + Hxy) 



16 V 32 + 5(?f + r c A ) ) ' 
with 



k(CG) := -5(r^ + r£), fc(CA) := 5r^ - r£, 

and 

k{TG) := 5r$ - r% k(TA) : = rf + r£. 



For instance, 



1 32 + 4rg + 10r^ 



F(CA) = — 



16 32 + 5^ + 5^ 



Corollary 41 in i/ie setting of proposition \Ju\ 

F( X )= 1 -(l + ^ 

1 J 4 V 32 + h{r c A + rg) 

fc(T) := 2rf , k(C) := -2rf , k{A) := 2r% k{G) := -2r C A . 



10 Symmetric rates 

10.1 Models 

Consider property (b) below. 

Property (b) The substitution rates respect the complementarity of the nu- 
cleotides. 

This means, first, that the rate of substitution from x to y and from x* to y* coincide, for 
every nucleotides x and y, where we recall that the involution z i— > z* is defined by 

A* := T, T* := A, C* := G, G* := C. 

This means also that the rates of YpR substitutions from CG to CA and to TG coincide, 
and that the rates of YpR substitutions from TA to CA and to TG coincide, that is, 

C C AT 

r A = r T =: r w , r c = r G =: r s . 

This means finally that the rates of YpR substitutions from CA and from TG to CG 
coincide, and that the rates of YpR substitutions from CA and from TG to TA coincide 
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As regards the single substitutions, the most general model such that (a) and (b) hold is 
described by matrices 

/ ■ v w v s w s \ 

v w ■ ws v s 

vw ww ■ vs 
\ w w v w v s ■ J 

where vs, vw, ws and ww are nonnegative rates. For instance, every nucleotide A mutates 
to T at rate vw , to C at rate vs, and to G at rate ws- The indices W and S refer to 
the classification of nucleotides according to the strength of their link in double stranded 
DNA, the link between C and G being strong (S) and the link between A and T being 
weak (W). 

One recovers Tamura's matrix when ws vw = vsww- On the other hand, assuming 
that (b) holds, condition (a) corresponds to the additional requirements that the two 
substitution rates from a purine to C coincide, and that the two substitution rates from a 
purine to T coincide. 

As before, one can complete the matrix by diagonal elements, which represent fictitious 
rates of mutation from a nucleotide x to x, and which leave the whole process unchanged. 
The full matrix is 

/ w w v w v s w s \ 

vw ww ws vs 

vw ww ws vs 
\ w w v w v s w s J 



10.2 Frequencies 

From now on, we assume that properties (a) and (b) hold, that the only nonzero YpR rates 
are rg and rw defined above, and we compute the frequencies of the nucleotides and of 
the YpR dinucleotides. 

Definition 42 Introduce the parameters 

<rs :=vs + ws, o- W -=vw + ww, v :=vs + v w , w :=w s + w w , 

and 

a := as + o~w = + ho- 
using the notations in section [HI one gets 

u s + u w = vq- w , v = 2v , w = 2u> , s R = s Y = a. 
Using theorem [01 one gets 

/ F(CG) \ x / vsas 
Mx F(CA) =x v s a w + v w a s 
V F(TA) J V vwow 

where the 3x3 matrix M is 

(a (a + us) + {cr + vs) r w vus ~vsr s 

auw - (2vs + w )r w cr (3v + w ) a u s - (2v w + w ) r s 
-vwrw o-uw a (a + uw) + {cf + vw) r s 

From there, tedious computations lead to the following formulas. 
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Theorem H In the symmetric R/Y + YpR model described by the parameters vs, ws, 
vw, u>w, r s and r w, the frequencies of the YpR dinucleotides at equilibrium are 

F(xy) = D(xy)/(4D), 

where D and D(xy) are polynomial functions of the parameters (vs,ws,vw,ww, r s, r w), 
homogeneous of degree 3, and defined as follows. First, 

D := D + r s D s + r w D w + r s r w D S w, 

with 

D :=a 2 (a + 2v ), D sw ■= 2 {a + v ), 

and 

Ds := a {a + 2v ) +aws + v as, D w := a (a + 2v ) + aww + v^aw- 
As regards the dinucleotide CG, 

D{CG) :=D Q (CG)+rsD s (CG), 

with 

D (CG) := (a + 2v ) a 2 s , D S {CG) :=aw s + a s (v + 2v s ). 
As regards the dinucleotide TA, 

D{TA) := Dq(TA) + r w D W (TA), 

with 

D (TA) := (a + 2v ) cr^, D W (TA) := aw w + &w {vq + 2v w )- 
As regards the dinucleotide CA (and the dinucleotide TG), 

D{CA) := D (CA) + r s D S {CA) + r w D W (CA) + r s r w D SW (CA), 

with 

D (CA) := (a + 2v ) a s aw, D SW {CA) := a + wo, 

and 

D S {CA) := a s {a + 2v w ), D W {CA) := a w {a + 2v s ). 

Finally, 

aF{C) = a s /2-r w F{CG)+r s F{TA). 

One can check that, when rs = ry/ = 0, F{x) = Fq(x) and F(xy) = Fo(xy), where, for 
instance, 

F (CG) = F (C) F (G), F (C) =F (G) = ^. 

2 a 
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10.3 CpGo/e and TpAo/e ratios 

We come back to the original model of YpR substitutions, where CpG is the only active 
YpR dinucleotide, hence 

r s = 0. 

In this case, 

D (CG) 



F(CG) 



A{D + r w D w y 
and F(C) is given by the relation 

aF(C) =a s /2-r w F(CG). 

This is enough to get some information about the ratio of the observed and expected 
frequencies of CpG. 

Definition 43 Introduce 

CpGo/ e: . F < CG > 



F(C)F(G)- 

Proposition 44 (CpGo/e in a simple case) In the setting of theorem\Bl assume fur- 
thermore that rs = 0. Then CpGo/e ^ 1 for every rw, CpGo/e is a non increasing 
function of rw , CpGo/e — ► 1 when rw — ► 0, and CpGo/e — > when rw — > oo. Further- 
more, when rw —> 0, 

CpGo/e = 1 - rw ^cg + o(r^), 
where Kcg is positive and defined as 

_ (a + 3t; ) o-^y j- <7 ww 
CG ''~ a 2 (a + 2v ) 

Another quantity of interest is the ratio of the observed and expected frequencies of TpA. 

Definition 45 Introduce 



T P Ao/e:. F(TA) 



F(T) F(A) ' 

In the setting of theorem El and even if one assumes furthermore that rs = 0, the situation 
is less clear than for CpGo/e. For instance, one can show that, when rw — > 0, 

TpAo/e = l-r w K TA + o(r w ), 

where Kta is defined as 

K TA , LTA 



a 2 Wq (a + 2v y 
and 

L T a ■= awo~ s (o- + 2v ) + a w (a(o- + 2v ) + aww + vaw) 

—a 2 (aww + vaw) — 2a 2 awvw- 

The sign of Lta is difficult to decipher, in fact assume that there exists c such that 

ws = cvs, ww = cvw- 
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Then the expression of Lta reduces to 

L TA = craw ((3 + c)v w - (1 + c) v s ). 
This shows the following result. 

Proposition 46 (Values of TpAo/e) In the general setting of proposition \44[ both cases 
TpAo/e ^ 1 and TpAo/e ^ 1 are possible. However, when ws = ww and vs = v\y, 

TpAo/e 1. 

11 The simplest model 

In this section, we provide the values of the 16 frequencies of dinucleotides at equilibrium. 
To avoid awkward formulas, we consider the simple non trivial case. 

Definition 47 The simplest R/Y + YpR model is such that, for every nucleotide x, 

v x = w x = 1, 

and such that all YpR substitutions but those starting from CpG are excluded, i.e. 

r C = r A = r G = r T = r T = r A = Q) 

and such that the rates of CpG to CpA and CpG to TpG are equal, that is, 



In this section, we consider the simplest model. 
11.1 On symbolic resolutions 

Theoretically, one has to solve an appropriate linear system related to the dynamics on 
the discrete circle with N + 2 = 4 vertices, that is, of size 4^+2 = 256. The translation 
invariance yields a system of size m(N + 2) = 70, see section 18.11 Using the invariance 
with respect to both translations of the discrete circle and nucleotide complementarity, 
the size of the linear system to be solved is further reduced to 42. (This is because 14 of 
the 70 classes that the invariance by translations induces, are invariant by the nucleotide 
complementarity as well, the 56 other classes being grouped into pairs. We omit the details 
of this enumeration.) 

We solved this 42x42 system symbolically, using Maple™. We computed the full invariant 
distribution of (X l (s)) s >o with I := {1, 2, 3, 4} but we only give the equilibrium frequencies 
of dinucleotides because these are the quantities of greatest interest. Checking the results 
by human computations seemed prohibitively time-consuming and tedious but, to confirm 
the validity of the formulas, we performed some tests. In particular, we compared the 
formulas with the following. 

• The exact formulas obtained by human computations for YpR dinucleotides. 

• The results obtained by numerically solving the system with MATLAB® for various 
settings of the parameters. 
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• The results of extensive Monte-Carlo simulations, usually with 10 8 runs for each 
setting of the parameters. 

All these tests confirmed the values given below. 
11.2 Frequencies 

We first recall the values of the frequencies of nucleotides, deduced from previous sections. 

Definition 48 Introduce 

K (xy) = 4U(xy) + 2R{x) + Y(x) + R(y) + 2Y(y). 

where 

U = 1 TG + loA - 2 Icg, R = 1a-1g, Y = \t -Re- 
introduce ^ 

a(g) := , b(o) : = . 

w 96 + lV v ; 32 + 10e 

Definition 49 For every polynucleotide xx---Xk, define a function K(x\ ■ ■ ■ Xk) by the 
relation 

Theorem I (1) For every nucleotide x, K(x) does not depend on g, and 

K(x) = 2(R(x)+Y{x)). 
Hence K(A) = K{T) = 2 and K{C) = K(G) = -2. 

(2) For every YpR dinucleotide xy, K(xy) does not depend on g, and K(xy) = Ko(xy). 
Hence 

K(CG) = -10, K(CA) = K(TG) = 4, K(TA) = 2. 

(3) For every dinucleotide, K{xy) — > Ko(xy) when g — > 0. 

Part (3) reads as 

K (GG) = K (CC) = -3, K (TT) = K (AA) = 3, 
K (AG) = K (CT) = 1, K (TC) = K (GA) = -1, 

and 

K (AC) = K (GT) = 0, K (AT) = 4, K (GC) = -4. 
Here is a consequence of theorem 

Proposition 50 The nucleotides C and G are always less frequent than A and T. More 
precisely, for every positive g, 

20% ^ P(C) = P(G) < 25% < P(A) = P(T) ^ 30%. 
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Furthermore, the dinucleotides CG and TA are repulsive and the dinucleotides CA and 
TG are attractive, in the sense that 

F(CG) ^ F(C) F(G), F(TA) < F(T) F(A), 

and 

F(CA) ^ F(C) F(A), F(TG) > F(T) F(G). 

Definition 51 For every dinucleotide xy, introduce K\(xy) as 

K(xy) =: K (xy) + gK^xy). 

Theorem J For every dinucleotide xy, K\(xy) assumes one of the five values 0, ±a(g), 
and ±b(g). More precisely, 

Ki{xy) = a(g) (R{x) R(y) + Y(x) Y{y)) + b{g) R(x) Y(y). 

As regards, for instance, the dinucleotides Ax, this means that 

K\ (AA) = a{g), K^AC) = -b(g), K^AG) = -a{g), K\ (AT) = b(g). 
Going back to frequencies, this reads as 

nAA) = 1(i + _jl_( 3+ 3e 



16 V 32 + 10p V 96 + 19£ 



16 V 32 + Wg V 32 + W Q 



16 V 32 + Wg V 96 + 19 Q 
F(AT) = ! (l+ U + Ae 



16 V 32 + 10p V 32 + 10j? 
Similar formulas are available for the 12 other dinucleotides. 



11.3 Remarks 

One sees that, when g — > oo, gK\(xy) converges to a nondegenerate limit. 
Corollary 52 When g — > oo, K(xy) converges to K OQ (xy), with 

K^xy) := K (xy) + A ( R ( x ) R{y ) + Y (x) Y(y)) + | R(x) Y(y). 

This means that 

1 (. K, 



F(xy) - F^xy) := ^ ( 1 + ^M] 



16 V 32 J' 

For instance, 

F(C) = F(G) - 1 F(A) = F(T) - A 

and 

F(CG) - 0, F(CA) = F(TG) - ^, F(TA) - A. 
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When g = 0, one recovers the i.i.d. evolution, hence F(x± ■ ■ ■ x^) = l/4 fc for every polynu- 
cleotide X\ - ■ ■ Xk- 

The model makes sense for every g ^ — 1. For instance, g = — 1 forbids the transitions 
from CG to TG or CA, but allows the transitions from CG to AG, GG, CC and CT. 
Although the model does not make sense when g < — 1, the frequencies that we are able 
to compute are all positive, even formally for some values of g < —1. It may happen that, 
as soon as g < — 1, the expressions of the frequencies of some polynucleotides are indeed 
negative. In any case, when g = — 1, 

F(C) = F(G) = ^ F(A) = F(T) = ±. 
12 Continuous dynamics 

In this section, we show that one can study the evolution of the YpR frequencies, using 
essentially the techniques used to describe the statics. Consider for instance the simplest 
model, see definition IT71 For every YpR dinucleotide xy, the frequency F(xy)(s) of xy at 
time s satisfies 

— F(xy)(s) = F(x)(s)+F(y)(s)-8F(xy)(s) + 

+ gF(CG)(s) (1 TG + 1ca){xv) ~ 2gF{CG)(s) l CG {xy). 

Likewise, for every nucleotide x, 

— F{x){s) = -4F(x)(s) + 1 + e(x) gF(CG)(s), (3) 

where e(C) = e(G) = +1 and e{A) = e(T) = —1. Hence, F(CG)(s) satisfies the second 
order evolution equation 

F(CG)(s) + 2{g + 6)^- F{CG)(s) + 2 (16 + 5g) F(CG){s) = 2. 
as z as 

One can check that the two roots of the characteristic polynomial of this linear differential 
equation have negative real parts. Hence F(CG)(s) indeed converges when s — > oo to the 
fixed point 1/(16 + 5g) of the equation, that is, to the stationary value F{CG). Plugging 
this into © for every value of x yields the convergence of AF(x)(s), when s — > oo, to the 
value 

l + e{x)gF{CG) =4F(x). 



Part C Simulation 

13 Coupling from the past 

A consequence of the construction of the previous sections is that we can simulate the 
restriction of the dynamics on Z to any finite interval of sites of length n, without truncation 
errors due to neglecting the influence of remote sites. One adds a site to the left and a 
site to the right, one performs simulations for the system on these n + 2 sites, and the 
projection on the n original sites yields the desired simulation. 
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In this section, we show how the coupling from the past (CFTP) methodology of Propp 
and Wilson can be applied in our context. Our motivation is two-fold. First, estimates 
about the coupling times automatically yield estimates on the speed of convergence of 
the dynamics to the stationary distribution. In our context, this applies to the speed of 
convergence of Ii a,b (X s ) to U a,b (fi). Second, the CFTP technique allows to sample exactly 
from H a,b (/j,). Despite the results of the previous sections, which show that the obtention of 
exact expressions of U a,b ([i) amounts to the inversion of a linear system, this task becomes 
computationally infeasible as soon as the number b — a + 1 of sites is large, say greater than 
6. Hence, Monte-Carlo simulations are still useful, if only to confirm the results obtained 
by inverting the linear system! 

In the whole section, we fix I = {a, ... , b}. 
13.1 Coupling events 

We first define the notions of coupling events and locked sites. 

Definition 53 (Coupling events) We say that a coupling event occurs at site i and times 
(si, s 2 , S3) if the following assertions hold. 

• s\> S2 and S3 > S2, 

• ~*i belongs to Ufa U Ufa U V,f i} U Vfa, 

• "S3 belongs to Ufa UUfa U Vfa U Vfa, 

• — S2 belongs to \J z( z^Uf , 

• (-si, -s 2 ) n \JzeA u \ z (i) u v \\i) %s em pty, 

• (-s 3 , -s 2 ) n \J zeA Ufa u V r(t) ls empty. 

Definition 54 (Locked sites) We say that the site i is locked at times (u, v) if, for every 
times a and s such that a ^ — u and s ^ —v, the set 

* (T (^4. I ) ^i,«) 

contains but one single element. In other words, § a (x,£,i,s) = ^ a (x', £, i, s) for every 
initial configurations x and x' in A ■ 

Recall that <3? CT is introduced in sectional Definitions 023 and |^] both involve I, through the 
definitions of l(-) and r(-). Our next proposition shows that the notions of coupling event 
and locked site are related. The proofs of the results in section El are in section HH.41 

Proposition 55 If a coupling event occurs at site i and times (si, S2,ss), then i is locked 
at times (s^,S2), with S4 := max(si,S3). 

In words, for every initial condition imposed before time — max(si, S3), the ith coordinate 
is the same after time — s 2 - A consequence is that all the trajectories of the process have 
coalesced as far as site i is concerned. 
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13.2 Locking times 

We wish to estimate, or at least to control, the time that is needed to lock a given collection 
of sites. We start with one site. 

Definition 56 (Locking times) Let Tj be defined as the supremum of the times S4 such 
that there exists si, S2 and S3, with the following properties: S2 ^ 0, S4 = max(si, S3), and 
a coupling event occurs at site i and at times (si, 32,53). 

We now define additional numerical parameters. 

Definition 57 (Combined rates) Introduce 

k r := ^ mm(c z ,v z ), n Y ■= ^ mm(c z ,v z ), k:=k r + k y , 
z=A,G z=T,C 

and 

v R := ^2{v z -c z ) + , v Y : = ^ (v z - c z ) + , v:=v R + v Y . 

z=A,G z=T,C 

Note that kr + vr = va + vq and ny + vy = vc + vr- 
Definition 58 Let 

{v A + vq) x {v c + v T ) 

Oi '■= tylR = — -75-. 

(VA + Vq + Vc + VtY 

These parameters allow us to control the distribution of the locking times, as follows. 

Proposition 59 Each locking time T% is stochastically dominated by the random variable 

H 1 +-- + H z , 

where {Hk)k^i is a sequence of i.i.d. Gamma (3, k) random variables, and Z ^ 1 is a 
geometric random variable with parameter a, independent from (-fffc)fc^i- In particular, Tj 
is almost surely finite and E(Tj) ^ 3/(kck). 

We now proceed to bound the tail of Tj. For each k ^ 1, the distribution of H± + . . . + 
is Gamma (3k, k). 

Definition 60 Introduce n a = — 41og(l — a) /a, hence n a is finite and n a ^ 4. 

Lemma 61 For every nonnegative integer N, 

P(#i + • • • + H z > Nn a /n) < 2 (1 - a)*. 

Our interest lies in the coupling time of whole intervals, defined below. 

Definition 62 (Locking times of intervals) The locking time T a ^ of the sites in the 
interior of the interval I = {a, . . . , b} is 

T a)b = max{Ti ; a + 1 < i < b - 1}. 
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Definition 63 Introduce the integer 

[ 6-0-1 

Ka,b — o 

Proposition 64 For every integer N, 

Q(T atb > Nn a / K ) < 3 [l - (1 - 2 (1 - a) N )' 

Remark 65 The bound above has the nice property that it does not involve the rates r v z of 
the YpR mutations except, through the c z , when they are negative. 

More readable forms of proposition IMl might be proposition IH71 and corollary IM1 below. 
Definition 66 Let 

t a ,b = 41og(6fc a)6 ). 

Proposition 67 (Control upon the locking times) For every t, 
Q((a«) T a , b > t a , b - log(l - a) + 1) < exp(-t/4). 

Definition 68 Let TL) denote the locking time of n consecutive sites. 

For instance T( n ) is distributed as To in +i. Note that 3&o,n+i ^ n + 2 and that a ^ ^, hence 
4 log 2 — log(l — a) ^ 6 log 2. This yields the following corollary. 

Corollary 69 For every t, 

Q((a K )T (n) ^log(n + 2) + 61og(2)+i) < exp(-t/4). 

13.3 Consequences 

The now traditional Propp- Wilson method induces that, whatever the initial condition x 
in .4 1 with I = {a, . . . , b}, for every J C I, 



, where T = max T~ , 



is distributed according to the projection of the stationary distribution m on A J . In 
particular, the distribution of 

is n a+1 ' 6-1 (fJ, a> b) , that is, by proposition n a+1,6_1 ( / u). We state this as a proposition. 
Proposition 70 For every a O, the distribution ofIl a > b fe^^ +1 (xi0( )) is n a,fc (//). 
Hence proposition RT71 yields the result below. 

Proposition 71 For every positive t, the distance in total variation between the distribu- 
tion U a ' b (Xf) at time 

s = (* -l,6+l - log(l - a) + t)/ {oik), 
and the limiting distribution H a ' b (fi), is at most exp(— t/4). 
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13.4 Proofs 



Proof of proposition 1551 

We shall in fact prove the following assertion: assume that site i is locked at times 
(si, 52,53), then, for every a ^ — S4 and s ^ — s 2 , the sets 

are all singletons. 

We first check the claim when s = —s 2 . By the definition of si, at time — si, either a 
move of type U occurs at site yielding an A or a G unconditionally, or a move of type 
V occurs, namely a transversion to a purine, yielding a purine if site i was not already 
occupied by a purine. As a consequence, the set 

is a singleton. Once again by the definitions, we ruled out the possibility that any move of 
type U or V occurred at site l(i) between the times — si and — s 2 . Furthermore, moves of 
type W, R and Q, when applied to a purine, can only yield a (possibly different) purine. 
This implies that the set 

Q (<Z> a (A\^\(i),-s 2 )) 

is a singleton as well. The same argument applies symmetrically to ^(^^(A 1 , £, r(i), — S2)). 

As regards ^ a (A 1 , —82), this is a singleton since a move of type U occurs at site i at 
time — S2- Lemma [TTfl above shows that, for every x in A 1 , the values of 

e($ a (x,£,l(i),s)), * ff (x,£,i,s), J7(* ff (x,£,r(i),s)), 

for every s ^ — S2, are completely determined by ^ and by the values of 

Q($ a (x,t,l(i),-S2)), ^(x^.i.-aa), ^(*<t(x,C, r(»), -s 2 ))- 

Since these values are the same for every x in A 1 , so is the case for ^( ( I> (J (-, £, l(i), s)), 
£, i, s), and ry($ CT (-, £, r(i), s)), for every s ^ — s 2 . This concludes the proof. □ 

Proof of proposition 1591 

Recall the convention that sup0 = —00. Define Mq := 0, and, inductively for k ^ 1, 

• -L k := sup (-00, -M fc _i) n (U e ^f), 

. -C4 := sup (-00, -L k ) n (lUx^ U V iw) . 

. -F fc := sup (-00, -L k ) n (U^«* (i) u V* (i) ) , 

• -Mjfe := -max(U k , V k ). 

Define K as the smallest integer k ^ 1 such that 

belongs to Ufa U Wg } U V,f i} U Vjf f) , 

and 

-V k belongs to Wg } U Uj {i) U v£ } U Vj {i) . 
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Then, provided that K is finite, a coupling event occurs at site i and times (Uk, Lk, Vk), 
hence Tj ^ Mk as soon as K is finite. Furthermore, standard properties of Poisson 
processes and the independence of the Poisson processes that are associated to different 
sites show that the sequence 



(L k — M k _i, U k - L k , V k - L k 



'k>l 



is i.i.d., and that, for every given k ^ 1, L k — M k -\, U k — L k and V k — L k are mutually 
independent and exponentially distributed with parameters k, k + v, and k + v respectively. 
Finally, K ^ 1 is independent from (L k — M k _i, U k — L k , V k — L k ) k ^i, and geometrically 
distributed with parameter a and expectation 1/a. Writing 

M k - M fc _i = L k - M k ; + max(£/ fc - L k , V k - L k ), 

and recalling that Tj ^ Mk, one gets 

E(Ti) < - ( - + 



a \k 2(k + v) 

Simpler upper bounds obtain as follows. Since k + v ^ k, the distribution of the random 
variable max(U k — L k , V k — L k ) is (crudely) dominated by the distribution of the sum of 
two independent k exponential random variables, hence the distribution of Tj is dominated 
by the distribution of the sum of three independent k exponential random variables. □ 

One sees that 



2an 

Proof of lemma l6lt 

By the homogeneity of the Gamma distributions, we assume that K = 1. The nonnegativity 
of the random variables (H k ) k implies that, for every integer k and every real number 
t > 0, 

F(Ht + --- + H z ^t)^P(Z^k + l)+ P(i?! + ... + H k ^t). 

The first term on the right hand side is (1 — a) k . By Cramer's bound and the value of the 
Laplace transform of the standard exponential distribution, evaluated at ^ u < 1, the 
second term is at most 

(1 - u)~ 3k exp(-ut). 

Assume that t = n a N for an integer N, and choose u = a and k = N. Then the proof is 
complete, since for these values, 

(1 - af = (1 - u)~ 3k exp(-ui) = (1 - a) N . 

□ 



Proof of proposition 1641 

Write I = Jq U I\ U I2, where Ij collects the sites in I that are equal to j modulo 3. For 
j = 0, 1 and 2, let = max{Tj ; i € Ij}. 

By the independence properties of the collection (Tj)j, for each j, the random variables 
(Ti)i G j j are i.i.d. Furthermore, for each j, Tu\ involves at most k a j> sites in I. Hence, for 
every nonnegative t, 

Q(T (i) ^ t) < 1 - (1 - Q(Tj ^ t)) k ^. 
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Since T a ^ is the maximum of the three random variables 

Q{T a ,b >t)^ Q(T (0 ) > t) + Q(T (1) > i) + Q(T (2) ^ t). 
One concludes, using the upper bound of Q(Tj ^ i) in lemma I6T1 above. □ 

14 Practical issues 

This section is devoted to some practical issues related to an effective implementation of 
the CFTP method of simulation of R/Y+YpR systems. We consider two slightly different 
versions of the method and we give short presentations of both. The key point of each 
version is that an efficient detection of the coalescence is at hand. For the sake of readability, 
we do not provide detailed pseudo-codes but only the basic schemes used to implement the 
methods. 

14.1 Additional notations 

Fix £ in fii and x in j} with I = {a, . . . , b}. Let 

T' = (-oo,o)nUUwf(£)uvf(6, 

and 

t" = (-oo,o) n (J |J w?(0unm u am. 

Let t'_ 1 = and (— 1' , —t[, • • • ) denote the ordered list of points in T', that is: 

T' = {-t' Q , —t[, —t' 2 , ■ ■ ■} with 0> -t' > -t[> ■■■ . 

For every n ^ 0, we describe the move that occurs at time t' n through the site c' n where 
the move occurs and through the description M' n = (z' n , /4) of the move, as defined above 
in sections 15.21 and 15.31 

For every n ^ 0, let N n denote the number of points of T" in the interval (— t' n , — t' n _i). 
When N n ^ 1, let 

T" n (— t' n , —t' n _i) = {—t'n,l < ■ ■ ■ < —t'n,N„} ■ 

Finally, we describe the move which occurs at time t" n k through the site k where the 
move occurs and through the description M" k = (z'^ k ,f^ k ) of the move, as above. 

14.2 First algorithm 

We describe a routine which yields a random element of J[^ a+1 ''"' b ~ 1 ^ with distribution 

n«+i.&-i( M ). 

*** Coalescence detection *** 
Let n := 0; 

Until for every i in {a + 1, . . . , b — 1} , Tj ^ t' n _ 1 , do : 
{ Generate and store M' n and c' n ; Let n:=n + l; }; 
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*** Sampling *** 

Let x := (A, ... ,A); 

For k going backwards from n — 1 to do: 

{ Let x := 1 {x,M' k ,c' k ); 
Generate N^; 

For m going from 1 to Nk do: 

{ Generate M£ m and c£ m ; Let x := 7 (x, M£ ro , c£J ; } } 
Return ir*- 1 ' 6-1 ^ . 

The feasibility of the above routine relies on several facts. First, it is easy to generate 
realizations of the random variables c' n and M' n , whose distributions are explicitly known 
and standard. Second, one can check whether or not ^ t' n , knowing only the sequence 
c' k and M' k for k between and n, and this can be done in a step-by-step way, updating 
information about sites as time goes backwards and new moves are introduced. 

One advantage of this method is that one does not have to generate the random variables 
M' kml c km on the coalescence detection pass nor to store them, but just to compute their 
effect on x in a step-by-step way during the sampling pass. This helps keeping memory 
storage requirements and execution time to a minimum. 

14.3 Second algorithm 

We need some additional definitions, because the detection of coalescence devised in this 
second algorithm uses a slightly different technique from the one in the first algorithm. 

Definition 72 (Coupling event on an interval) Let J C I. A coupling event of type 
(si, S2, S3, J) occurs at site i when the following conditions hold: 

• s\ > S2 and S3 > S2, 

. - Sl G Ufa U Ufa U Vfa U Vfa or l(<) € J, 

• -*3 G Wg) U Uj^ U V? {i) U or r(i) G J, 

• (-si, -s 2 ) n Uz6-4 W l(i) U V \\i) is ^Pty or '(«) G J ^ 

• ("S3, -S2) n \JzeA U r(i) n V r Z (i) * S em ^^ 0r r (*) E J ' 

The key observation is the following: assume that a coupling event of type (si, S2, S3, J) 
occurs at site i and that every site c in J is locked at times (max(si, S3), S2). Then, the 
site i is locked at times (max(si, S3), S2), as well. 

Our second routine yields a random element of J({ a+1 '-' b ~ 1 } with distribution H a+1 ' b ~ 1 (p,). 
We make use of a map r : N — > N such that 

• r(0) = 0, 

• r(k + 1) ^ r(k) + 1 for every k ^ 0. 
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*** Coalescence detection *** 
Let n := 0; 

Until J = {a + 1, . . . ,b - 1} do: 
{ Let J := 0; 

Generate and store M' k and c' k for k := r(n), . . . ,r(n + 1) — 1; 
For £ going backwards from r(n) to do: 
{ If a coupling event of type (si,t' e , S3, J) occurs at site c' e , with 
s%, S3 ^ t'/\ , and if c' e is not already in J, then let J := J U {c^} ; } 
Let n := n + 1 ; } 
*** Sampling *** 
Let x := (A, . . . , A); 

For k going backwards from r(n) — 1 to do: 
{ Let x :=j(x,M^,c' k ); 
Generate Nf.; 

For m going from 1 to do: 

{ Generate M' k \ m and d' km ; Let x := 7 (x, M£ m , c£J ; } } 
Return n 4 * 1 ' 6-1 (sc) - 

As before, the feasibility of this routine relies first on the fact that the various random 
quantities can be easily simulated when needed by the algorithm. The second key point is 
that, going forward from time — t' T ( n \ to time 0, one can easily detect the coupling events 
of type (si,t' £ , S3, J) with si and S3 ^ ^ T ( n y simply by recording for each site i the latest 
(with time going forward) move that affected this site before the current time —t' £ . The 
second pass of the routine, that is, computing x once the coalescence has been obtained, 
is the same as in the first method. 

Despite the fact that this second routine may use several passes to detect coalescence, 
instead of a single one as in the first routine, the use of sites that are already locked, to 
detect coupling events, makes it more effective in some cases, the coalescence time being 
always smaller than in the first routine. The choice of the updating policy contained in 
the function r(-) is still subject to empirical adjustment. The aim here is to reduce the 
number of passes as much as possible, while keeping each pass not too time-consuming. 

Remark 73 In either algorithm, one never generates the ringing times which govern the 
dynamics themselves, but only the sequence of moves that they induce. 

14.4 A special case 

It is possible to improve upon the previous results in a special case, namely when the only 
YpR substitutions are from CpG and when the rates of substitutions are such that 

w c = v c , w G = v G . 

Throughout this section, we assume that these additional assumptions hold, and we only 
state the relevant results, since the proofs closely parallel those in the general case. 

The key observation is the following modification of lemma which allows for the use 
of coarser quotients of the state space A. Recall that l x (x) := 1 and that l x {y) '■= for 
every y 7^ x. 

Fix I = {a, ... ,b}. 
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Lemma 74 For every site i in I and every time s a, the functions 

3-C [p\( x > «)] , <fl(x, s )> !g [p* (z, £, r W> s )] ) 

are measurable with respect to the following initial conditions and source of moves: 
lcOl(i)), Xi, l G (x r (i-)), Z\(i)[<T, s], £i[(T,s], £ r (i)[cr,s]. 

This lemma suggests a modified definition of coupling events. 

Definition 75 (Modified coupling events) A modified coupling event occurs at site i 
and times (si,S2,ss) if: 

• si > S2 and S3 > S2, 

• -*i belongs to Ufa U Ufa U Z^ f) U V,f. } U Vfa, 

• "S3 &e/on ffS to U«J U V£ } U Vj i} , 

• — S2 belongs to {J z( z^Uf , 

• (-si, -s 2 ) n LU^ifi) U is empty, 

• (-S3, -S2) n \J z eA U r(i) U V r« is empto. 

The following property is the analogue of proposition 1551 

Proposition 76 // a modified coupling event occurs at site i and times (si, 82,83), then i 
is locked at times (34,82), with S4 = max(si,S3). 

Definition 77 (Modified coupling times) The modified coupling time T, of site i is the 
supremum of the times S4 such that there exists s\, S2 and S3, with the following properties: 
S2 ; S4 = max(si,S3) ; and a modified coupling event occurs at site i and at times 
(si,s 2 ,s 3 ). 

The modified coupling time T a ^ of the interval I = {a, . . . , b} concerns the sites in the 
interior of I, namely 

T 0) (, = max{Tj ; a+l^i^b — 1}. 

The definitions below mimick definitions l58l and l6f)l 
Definition 78 (Modified combined rates) Introduce 

KAGT = ^2 min ( c 2^2), KACT = ^2 m[n ( c z,V z )- 

z=A,G,T z=A,C,T 

and 

~ KAGT + Vr KACT + VY 
a = x . 

K + V K + V 

The arguments in section U.S. II yield the following estimates. 
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Proposition 79 For every integer N , 



\ b > iVn 5 / K ) < 3 [l - (1 - 2 (1 - a) N ) K ' b 



For every t, 



an) T a , b > t„ b - log(l - a) + 1) < exp(-t/4). 



Since a ^ a, the estimates below compare favorably with those obtained in section fl3.il 

To comply with modified coupling events, the two methods of practical detection of the 
coalescence, described in sections 114,21 and 114. 3| can be modified in a straightforward way. 
This yields a priori shorter coalescence times, but the effective magnitude of this gain 
should be evaluated, relying on concrete cases. 



15 Index of notations 

Nucleotides 

A={A,C,G,T} 

A* = G,T* = C, C* = T, G* = A 
ir(A) = tt(G) = R, 7r(C) = tt(T) = Y 

p(A) = p(G) = R, p(C) = C, p(T) = T (the application p defined on A is not to be 
confused with the real number g, see below) 

77(C) = V (T) = Y, r,(A) = A, V (G) = G 

Rates of substitutions 

v x : rate of the transversions to x 
w x : rate of the transition to x 

r% : rate of the substitution from yx* to yx when yx is a YpR dinucleotide; rate of the 
substitution from x*y to xy when xy is a YpR dinucleotide 

g : rate of the substitutions CG — > CA and CG — > TG when these are the only double 
substitutions and their rates coincide (the real number g is not to be confused with 
the application p defined on A, see above) 

Functionals of the rates of substitutions 

u x = v x — w x 

V = VA + VT + Vq + Vq, w = WA + wt + wc + Wq 

t Y = {v c + vt)/v, t R = (v A + v G )/v 

SA = Sg = Sr = VA + Wt + Wc + Vg, St = Sc = Sy = WA + Vt + Vc + Wg 
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v* A = v A /s Y , = v T /s R , v* c = v c /s R , v* G = v G /s Y 

kr = min(cA, va) + mm(c G ,v G ), ky = mm(c T ,v T ) + mm(c c ,v c ), k = k r + k y 

vr = (ca - v A ) + + (c G ~ v G ) + , vy = (or - v T ) + + (c c - v c ) + , V = VR + V Y 

ot = tRty , n a = — 41og(l — a) /a 

KAGT = kr + min(c T , vr), k A ct = k y + mm(c A , v A ) 

a = {k A gt + vr) (k AC t + v y )/(k + v) 2 

Rates of substitutions in the S/W symmetric case 

vw = va = vt when va = vt, vs = v c = V G when vc = vg 
ww = wa = wt when wa = wt, ws = w c = W G when wc = wg 

rw : rate of the substitutions CG — > CA and CG — ► TG when these two rates coincide 
rs : rate of the substitutions TA — > CA and TA — > TG when these two rates coincide 
OS =v s + ws, cr s = v w + w w 
vq = vs + t>w, w = ws + ww 
a = vq + wq = as + <tw 

Sets of sites 
1 = {a,...,b} or I = Z 

II' ' - ' 6 } : projection on the {o, . . . , 6} coordinates 
l(i) and r(i) : left and right neighbors of the site i in I 
'b-a-l 



t a b = 6 log k a h 
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